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Abstract 

Background: Parsimony and maximum likelihood methods of phylogenetic tree estimation and parsimony methods 
for genome rearrangements are central to the study of genome evolution yet to date they have largely been pursued 
in isolation. 

Results: We present a data structure called a history graph that offers a practical basis for the analysis of genome 
evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both 
substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of 
constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of 
phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define 
upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any 
history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral 
variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally 
demonstrate that for a given history graph G, a finite set of AVGs describe all parsimonious interpretations of G, and 
this set can be explored with a few sampling moves. 

Conclusion: This theoretical study describes a model in which the inference of genome rearrangements and 
phylogeny can be unified under parsimony. 

Keywords: Genome rearrangement, Phylogenomics, Ancestral reconstruction 



Background 

In genome evolution there are two interacting relation- 
ships between nucleotides of DNA resulting from two key 
features: DNA nucleotides descend from common ances- 
tral nucleotides, and they are covalently linked to other 
nucleotides. In this paper we explore the combination of 
these two relationships in a simple graph model, allowing 
for change by the process of replication, where a complete 
sequence of DNA is copied, by substitution, in which the 
chemical characteristics of a nucleotide are changed, and 
by the coordinated breaking and rematching of covalent 
adjacencies between nucleotides in rearrangement oper- 
ations. These processes have quite different dynamics: 
DNA molecules replicate essentially continuously, much 
more rarely substitutions occur and more rarely still rear- 
rangement operations take place. For this reason, and 
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because of inherent complexity issues, a wealth of models, 
data structures and algorithms have studied these pro- 
cesses either in isolation or in a more limited combination. 

Such evolutionary methods generally start with a set of 
observed sequences in an alignment, an alignment being 
a partitioning of elements in the sequences into equiva- 
lence classes, each of which represents elements that are 
homologous, i.e. that share a recognisably recent common 
ancestor. Though alignments represent an uncertain infer- 
ence, and though there optimisation for standard mod- 
els is intractable for multiple sequences ([1]), we make 
the common assumption that the alignment is given, as 
efficient heuristics exist to compute reasonable genome 
alignments ([2-4]). 

If the sequences in an alignment only differ from one 
another by substitutions and rearrangements that delete 
subsequences, or insert novel subsequences (collectively 
indels), then the alignment data structure is naturally a 2D 
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matrix. In such a matrix, by convention, the rows repre- 
sent the sequences and the columns represent the equiva- 
lence classes of elements. The sequences are interspersed 
with "gap" symbols to indicate where elements are missing 
from a column due to indels. From such a matrix align- 
ment, phylogenetic methods infer a history of replication 
([5]). Such a history is representable as a phylogenetic 
tree, whose internal nodes represent the most recent com- 
mon ancestors (MRCA) of subsets of the input sequences. 
To create a history including the MRCA sequences, addi- 
tional rows can be added to the matrix ([6-8]). Both the 
problem of imputing maximum parsimony phylogenetic 
trees from matrix alignments and calculating maximum 
parsimony MRCA sequences given a phylogenetic tree 
and a matrix alignment are NP-hard ([9,10]). 

In addition to substitutions and short indels, homolo- 
gous recombination operations are a common modifier 
of individual genomes within a population. The align- 
ment of long DNA sequences related by these operations 
is also representable as a matrix. However, the history of 
replication of such an alignment is no longer generally 
representable as a single phylogenetic tree, as each column 
in the matrix may have its own distinct tree. To repre- 
sent the MRCAs of such an alignment requires a more 
complex data structure, termed an ancestral recombina- 
tion graph (ARG) ([11,12]). It is NP-hard under the infi- 
nite sites model (no repeated or overlapping changes) to 
determine the minimum number of homologous recom- 
binations needed to explain the evolutionary history of a 
given set of sequences, and probably NP-hard under more 
general models ([13]). 

Larger DNA sequences, or complete genomes, are often 
permuted by more complex rearrangements, such that the 
matrix alignment representation is insufficient. Instead, 
the alignment naturally forms a graph called a break- 
point graph ([14,15]). Assuming rearrangements are bal- 
anced (neither involving the gain or loss of material), 
inferring parsimonious rearrangement histories between 
two genomes has polynomial or better time complex- 
ity, whether based upon inversions ([16]), translocations 
([17]) or double-cut-and-join (DCJ) operations ([18]). 
However, for three or more genomes with balanced rear- 
rangements ([19]) or when rearrangements are unbal- 
anced (involving the gain or loss of material) leading to 
duplications (additional copies of subsequences resulting 
from rearrangement), these exact parsimony methods are 
intractable. Exact solutions in the most general case are 
therefore only feasible for relatively small problems ([20]) 
before heuristics become necessary ([21,22]). 

Despite the hardness of the general case, there has been 
substantial work on computing maximum parsimony 
results, allowing for a wider repertoire of rearrangements. 
El-Mabrouk studied inversions and indels, though gave 
no exact algorithm for the general case ([23]). Recently 



Yancopoulous ([24]) then Braga ([25]) considered the dis- 
tance between pairs of genomes differing by DCJ opera- 
tions and indels, the latter providing the first linear-time 
algorithm for balanced rearrangements and indels, and 
the former proposing a data-structure to model duplica- 
tions. Many methods have been proposed that deal with 
the combination of rearrangements and duplications, for 
good recent reviews see ([26,27]), however until recently 
there were no algorithms to our knowledge that explic- 
itly unified both duplications and genome rearrangements 
as forms of general unbalanced rearrangement. First [28] 
provided a model allowing for a subset of duplications 
and deletions as well as balanced DCJ operations, giving 
a lower bound approximation, while [29] studied a model 
allowing atomic (single gene) duplications, insertions and 
deletions, but arrived at no closed-form formula for the 
total number of rearrangements. 

The graph model introduced in this paper is capable 
of representing a general evolutionary history for any 
combination of replication, substitution and rearrange- 
ment operations, including duplications and homologous 
recombinations. It therefore generalises phylogenetic 
trees, graphs representing histories with indels, ancestral 
recombination graphs and breakpoint graphs, building 
upon the methods described above. We start by introduc- 
ing this graph and then develop a maximum parsimony 
problem that, somewhat imperfectly, generalises maxi- 
mum parsimony variants of all the problems mentioned, 
facilitating the study of all these subproblems in one uni- 
fied domain. We adopt the common assumption that all 
substitutions and rearrangements occur independently of 
one another, and account for tradeoffs between them 
by independent rearrangement and substitution costs, 
which are themselves essentially sums over the num- 
bers of inferred events. Importantly, replications that are 
combined with unbalanced rearrangements are costed by 
the underlying rearrangement cost. We finally provide 
a bounded sampling approach to cope with the NP- 
hardness of the general maximum parsimony problem. 

Results 

Sequence graphs and threads 

Sequence graphs are used extensively in comparative 
genomics, in rearrangement theory typically under the 
name (multi or master) breakpoint graph ([14,15,22]) and 
in alignment under the name A-bruijn ([30]) or adjacency 
graph ([31]). We use the following bidirected form, which 
is similar to that used by [32] for sequence assembly. 

A (bidirected) sequence graph G = (Vg>Eg) is a graph 
in which a set Vq of vertices are connected by a set Eq of 
bidirected edges ([33]), termed adjacencies, A vertex rep- 
resents a subsequence of DNA termed a segment, A vertex 
x is oriented, having a tail side and a head side, respectively 
denoted x^ead an d Xtaib These categories {head, tail} are 
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called orientations. An adjacency, which represents the 
covalent bond between adjacent nucleotides of DNA, is a 
pair set of sides. We refer to the two sides contained in an 
adjacency as its endpoints. Adjacencies are bidirected, in 
that each endpoint is not just a vertex, but a vertex with 
an independent orientation (either head or tail). For con- 
venience, we say a side is attached if it is contained in an 
adjacency, else it is unattached. By extension, we say a ver- 
tex is attached if either of its sides are attached, else it is 
unattached. 

Associated with a sequence graph is a labeling, i.e. 
a function / : V G -> S* U {0} where £ = 
{A/T, C/G, G/C, T/A] is the alphabet of bases, which are 
oriented, paired nucleotides of DNA, and X* is the set of 
all possible labels consisting of finite sequences of bases 
in X. Bases and labels are directed. For p/r e E, p is the 
forward complement and r is the reverse complement. If 
a vertex is traversed from its tail to its head side, its label 
is read as the sequence of its forward complements. Con- 
versely, if traversed from head to tail, the label is read as 
the reverse sequence of the reverse complements. A vertex 
x £ Vq for which l(x) = 0 is unlabeled. A label represents 
a multibase allele. A path through the sequence there- 
fore represents a single DNA sequence (and its reverse 
complement) whose bases are encoded by the labels of 
the vertices, where unlabeled vertices represent missing 
information. 

A thread is a connected component in a sequence 
graph in which each side is connected to at most one 
adjacency. A thread graph is a sequence graph in which 
every connected component is a thread. In this paper 
we limit ourselves to investigating thread graphs. A 
thread may be a simple cycle, representing a circular 
DNA molecule, or have two unattached sides, in which 
case it represents a linear DNA molecule or fragment 
of a larger DNA molecule. An example thread graph 
is shown in Figure 1. A thread graph is phased, in 



that each thread is assigned a maximal DNA sequence 
(and its reverse complement), and any path though that 
thread corresponds to a subsequence of these maximal 
sequences. In contrast, a sequence graph that is not a 
thread graph may be unphased, in that there exist many 
possible maximal sequences for each of its connected 
components. 

History graphs 

Nucleotides of DNA derive from one another by a process 
of replication. This replication process is represented in 
history graphs, which add ancestry relationships to thread 
graphs. 

A history graph G = (Vg,Eg,Bg) is a thread graph with 
an additional set Bq of directed edges between vertices, 
termed branches. Each vertex is incident with at most one 
incoming branch. The event graph D(G) is the directed 
graph formed by the contraction a of adjacencies in Eq. For 
G to be a history graph D(G) must be a directed acyclic 
graph (DAG), a property we term acyclicity. Example his- 
tory graphs are shown in Figure 2(A,B), along with an 
event graph in Figure 2(C) for the history graph shown 
in Figure 2(B). 

To avoid confusion we define terminology to discuss 
branch relationships. Each weakly connected component 
of branches forms a branch-tree. Two vertices are homol- 
ogous if they are in the same branch-tree. A vertex y 
is a descendant of a vertex x, and conversely y is an 
ancestor of x, if y is reachable by a directed path of 
branches from x. If two homologous vertices do not have 
an ancestor/descendant relationship then they are indi- 
rectly related. For a branch e = (x, y), x is the parent of 
e and y, and y is the child of e and a child of x. Similarly, 
e is the parent branch of y and a child branch of x. A ver- 
tex is a leaf if it has no incident outgoing branches, a root 
if it has no incident incoming branches, else it is internal. 
We reuse the terminology of parent, child, homologous, 



GAGGGTGGCCCGAGAA ) ( V10 VVV1100 VVOVOllVll — | AACCCCAGCACAAATTTT 



GAGGGTGGCCTGAGAA V10VVV1100VV0V011V1 H V M VVOV0000001000VO 



GACGGTGGCCCGAGAA H A TATTCAGAACCTAAAAGTA AACCCCAGCACCAATTTT 



Figure 1 A thread graph. For visual appeal, vertices are the arrow shapes with the sides indicated by the ends of the arrows. Labels within the 
arrows represent the subsequence of DNA when traversed from the tail to the head side of the arrow, and are read as the reverse complement 
when traversed from the head to the tail side. Adjacencies are the lines connecting the ends of the arrow shapes. They are bidirected, i.e. there are 3 
unordered types: head-tail (symmetrically tail-head), tail-tail and head-head adjacencies. In prior illustrations of bidirected graphs ([32]) orientations 
were drawn on the lines, however the semantics of the graph are still the same, in that head and tail orientations are properties of the endpoints of 
the adjacencies, not the vertices. The graph contains three linear threads. As an example, because the middle vertex is attached in the opposite 
direction and therefore reverse-complemented when traversed left-to-right, the top thread represents the sequence "GAGGGTGGCCCGAGAA 
TACTTTAAGGTTCTGAATA AACCCCAGCACAAA^^" (from left-to-right, spaces used to distinguish vertex labels) and its reverse complement, 
'AAAAmGTGCTGGGG^TATOAGAACCTOAAGTATOTCGGGCCACCCTC" (from right-to-left). The colours of the arrows represent homologies 
between the vertices, these are not part of the thread graph itself, but are used in subsequent figures that build on this example. 
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Figure 2 History graphs. (A) A history graph representing homology relationships between the vertices in Figure 1 . Due to space, colours are used 
as labels (and match those in Figure 1 ), with unlabeled vertices shaded grey. Two vertices have the same colour shade if they have identical labels. 
The dotted arrows represent branches. Four ancestral vertices are added relative to Figure 1 to represent the common ancestral vertices of the 
subsets of homologous vertices in Figure 1 . (B) An extension of (A). (C) The event graph for (B). (D) A simple history with four epochs (1 - 4), and 
rearrangements given names corresponding to their type. It is a realisation for the graphs in (A) and (B). 



ancestor, descendant and indirectly related with sides. 
Two sides have a given relationship if their vertices have 
the relationship and they have the same orientation. Sim- 
ilarly, a side is a leaf (resp. root) if its vertex is a leaf 
(resp. root). 

Simple histories 

We formally define a class of history graphs, called simple 
histories, for which parsimonious sequences of substitu- 
tions and rearrangements can be trivially derived. 

A bilayered history graph is a history graph whose 
threads can be partitioned into root and leaf layers, such 
that every branch connects a vertex in the root layer 
with a vertex in the leaf layer. A rearrangement epoch 
is a bilayered history graph in which every branch tree 
is a root with 1 child, every vertex is labeled, and any 



set of homologous sides are either all attached or all 
unattached. For n > 2, an n-way replication epoch is a 
bilayered history graph in which every branch tree is a 
root with n children, every vertex is labeled, any set of 
homologous sides are either all attached or all unattached, 
if two root sides x a and yp are attached by an adja- 
cency then each child of OCqi IS attached to a child of 
yp, and a root vertex has at most one child with a label 
different from its own. An epoch is either a rearrange- 
ment epoch or an n-way replication epoch for some n > 
2. A layered history graph is a history graph that can 
be edge partitioned into a finite sequence of bilayered 
history graphs, such that the leaf layer of a contained 
bilayered history graph is the root layer of the following 
bilayered history graph. A simple history is a layered his- 
tory graph whose bilayered subgraphs are all epochs. An 
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example simple history with epoch subgraphs is shown in 
Figure 2(D). 

A substitution occurs on a branch if the labels of its 
endpoints are not identical Note that a substitution can 
occur either in a rearrangement or a replication epoch. 
The substitution cost of a simple history H is the total 
number of substitutions, denoted 5(H). The example sim- 
ple history in Figure 2(D) has substitution cost 4. Note the 
requirement that all homologous sides in a simple history 
be either all attached or all unattached does not forbid 
rearrangements involving the observed ends of chromo- 
somes (linear threads), because it is always possible to 
add material to a simple history at zero cost that attaches 
such unattached sides and allows them to participate in 
rearrangements. 

The substitution cost defined deals, abstractly, with 
changes of alleles in which any change between alleles is 
scored equally. However for the case £* = E, i.e. single 
base labels, the substitution cost is the minimum num- 
ber of single base changes. Furthermore, any history graph 
in which all homologous labels have the same length can 
easily be converted to a semantically equivalent history 
graph for which E* = E. More complex substitution costs 
to deal with the case where the alphabet represents the 
alleles of genes, as is commonly dealt with in rearrange- 
ment theory, are straightforward but not pursued here for 
simplicity. 

A rearrangement cycle in a rearrangement epoch is a 
circular path consisting of one or more repetitions of the 
basic pattern consisting of an adjacency edge in the root 
layer, a forward branch to the leaf layer, an adjacency edge 
in the leaf layer and a reverse branch to the root layer. Its 
size is the number of repetitions in it of this basic pattern 
minus 1. A linear path that follows this same basic pat- 
tern but does not complete every pattern and return to the 
original vertex is a degenerate rearrangement cycle. Its size 
is the size of the smallest rearrangement cycle that can be 
obtained from it by adding edges. The rearrangement cost 
of a simple history H is the total size of all rearrangement 
cycles in it, denoted r(H). This cost is known to be the 
number of double-cut-and-join (DCJ) operations needed 
to achieve all the rearrangements. 

Lemma L The rearrangement cost of an epoch is the 
minimum number of double-cut-and-join (DC]) opera- 
tions required to convert the root layers adjacencies into 
the leaf layers adjacencies. 

Proof. Similar to that given in [18]. □ 

The example simple history in Figure 2(D) has rear- 
rangement cost 3. 

Because different studies lay different emphases on sub- 
stitution or rearrangement (e.g. because of the available 



data) and because the events do not have the same 
probability in practice, we allow for a degree of freedom 
in the definition of the overall cost function. A (simple his- 
tory) cost function for a simple history is any monotone 
function on the substitution and rearrangement costs in 
which both substitutions and rearrangements have non- 
zero cost. 

Reduction 

Not all history graphs are as detailed as simple histories. 
We define below a partial order relationship that describes 
how one graph can be a generalization of another graph, 
so for example, a less detailed history graph can be used 
to subsume multiple simple histories. 

A branch whose child is unlabeled and unattached is 
referred to as having a free-child. A branch whose parent 
is unlabeled, unattached and a root with a single child is 
referred to as having a free-parent. A vertex is isolated if it 
has no incident adjacencies or branches. 

A reduction operation is an operation upon a history 
graph that either: 

• Deletes an adjacency, an isolated vertex or the label of 
a vertex. 

• Contracts a branch with a free-child or free-parent. 

See Figure 3(A-E) for examples. The inverse of a reduction 
operation is an extension operation. 

Lemma 2. The result of a reduction operation is itself a 
history graph. 

A history graph G is a reduction of another history graph 
G' if G is isomorphic to a graph that can be obtained 
from G' by a sequence of reduction operations, termed a 
reduction sequence. 

Lemma 3. The reduction relation is a partial order. 

We write G A G' to indicate that G is a reduction 
of G' and G < G' to indicate that G is a reduction of 
G' not equal to G ' . Like reduction and extension opera- 
tions, if G is a reduction of G G' is an extension of G. An 
examination of the reduction relation is in the Discussion 
section. 

History graph cost 

Using the parsimony principle, we now extend parsimony 
cost functions, previously defined on simple histories, to 
all history graphs. 

A simple history H that is an extension of a history 
graph G is called a realisation of G. The set H(G) is the 
realisations of G. 



Paten etal. BMC Bioinformatics 2014, 15:206 
http://www.biomedcentral.eom/1 471 -21 05/1 5/206 



Page 6 of 31 





Figure 3 Reduction operations. (A-E) For each case the graph on the left is a reduction of the graph on the right. (A) A label deletion. (B) An 
adjacency deletion. (C) A vertex deletion. (D) A contraction of a branch with a free-child. (E) A contraction of a branch with a free-parent. 



For a given cost function c the cost of a history graph 
Gis b 



C(G,c) = min c(s(H), r(H)). 
HeH(G) 



Lemma 4. The problem of finding the cost of a history 
graph is NP-hard. 

Proof There are parsimony problems on either substi- 
tutions or rearrangements alone that are NP-hard and 
can be formulated as special cases of the problem of 
finding the minimum cost realisation of a history graph 
([9,34]). □ 

The lifted graph 

Although determining the cost of a history graph is NP- 
hard, we will show that the cost can be bounded such that 
the bounds become tight for a broad, characteristic sub- 
set of history graphs. To do this we introduce the concept 
of lifted labels and adjacencies, which are used to project 
information about labels and adjacencies from descendant 
to ancestral vertices and are useful in reasoning about the 
cost of a history graph. 

The free-roots of a history graph G are a set of additional 
vertices such that a single, unique free-root is assigned 
to each root vertex in G (see the top of Figure 4(A)). For 
a vertex x, its lifting ancestor A(x) is the most recent 
labeled ancestor of x, else if no such vertex exists, the 
free-root of the branch-tree containing x. For a side x a its 
lifting ancestor (overloading notation) A(x a ) is its most 
recent attached ancestor, else if no attached ancestor 
exists, its ancestral side in the free root of the branch tree 
containing it. 

For a labeled vertex y, a lifted label is a label identi- 
cal to l(y) on its lifting ancestor. For a vertex the lifted 
labels is therefore a multiset, because the same lifted 
label may be lifted to a lifting ancestor from multiple dis- 
tinct descendants and each is considered an element of 
the multiset. 



For an adjacency {x a ,yp}, a lifted adjacency is a bidi- 
rected edge {A(x a ) } A(yp)}. In analogy with the lifted 
labels for a vertex, the lifted adjacencies for a side is the 
multiset of lifted adjacencies incident with the side. 

A history graph G with free-roots, lifted labels and lifted 
adjacencies is a lifted graph L(G). Figure 4(A) shows an 
example lifted graph that outlines these concepts. 

Some lifted elements do not imply change between 
descendant and ancestral states, while others do. To for- 
malise such a notion we define trivial and non-trivial 
labels and and adjacencies. A lifted label p of a labeled 
vertex x is trivial if l(x) = p. A lifted label p on an 
unlabeled vertex x (necessarily a free root) is trivial if it 
is the only lifted label on x. Otherwise a lifted label is 
non-trivial 

A junction side is a most recent common ancestor 
(MRCA) of two attached, indirectly related sides. For a 
history graph G, a lifted adjacency e = {A(x a ) , A(y p)} 
is trivial if there exists no unattached junction side on 
the path of branches from (but excluding) A(x a ) to (but 
excluding) x a , or on the path of branches from (but 
excluding) A(yp) to (but excluding) yp and either there is 
a (regular) adjacency between A(x a ) and A(yp) in G or 
A(x a ) and A (yp) are free roots, else e is non-trivial. See 
Figure 4(A) for examples of trivial and non-trivial labels 
and adjacencies. 

Ancestral variation graphs 

We can now define a broad class of history graphs for 
which cost can be computed in polynomial time. To do 
this we will define ambiguity, information that is needed 
to allow the tractable assessment of cost. There are two 
types of ambiguity. The substitution ambiguity of a history 
graph G, denoted u s (G), is the total number of non- 
trivial lifted labels in excess of one per vertex. Substitution 
ambiguity reflects uncertainty about MRCA bases. The 
substitution ambiguity of the history graph in Figure 2(B) 
is 1, as there exists one vertex with two non- trivial 
lifted labels. 
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Figure 4 A lifted graph. (A) The lifted graph for the history graph in Figure 2(B). The blue and red lines represent, respectively, trivial and non-trivial 
lifted adjacencies. Similarly, the blue and red stars represent, respectively, trivial and non-trivial lifted labels. The free-roots are shown as a set of 
vertices above the other vertices, with a grey line identifying their matching branch-tree. (B) The module in (A) containing non-trivial lifted edges. 
Lower case letters are used to identify the sides.. 



The rearrangement ambiguity of a history graph G, 
denoted u r (G), is the total number of non-trivial lifted 
adjacency incidences in excess of one per side. 

Rearrangement ambiguity reflects uncertainty about 
MRCA adjacencies. The rearrangement ambiguity of the 
history graph in Figure 2(B) is 5, because two sides have 
three incident non- trivial lifted edges and one side has two 
incident non- trivial lifted edges. 

The ambiguity of a history graph G is u(G) = u s (G) + 
u r (G). An ancestral variation graph (AVG) H is a history 
graph such that u(H) = 0, i.e. an unambiguous history 
graph. 

Lemma 5. Simple histories are AVGs. 

While simple histories are AVGs, so are many other his- 
tory graphs that are far less detailed. For example, the 
AVG in Figure 5 is not a simple history. 

Bounds on cost 

We provide trivially computable lower and upper bound 
cost functions for history graphs that are tight for AVGs. 

The lower bound substitution cost (LBSC) of a his- 
tory graph G, denoted s/(G), is the total number of dis- 
tinct (not counting duplicates in the multiset) nontrivial 
lifted labels at all vertices minus the number of unla- 
beled vertices with non-trivial lifted labels (necessarily 
free roots). The LBSC of the history graph in Figure 2(B) 
is 4. 

The upper bound substitution cost (UBSC) of a history 
graph G, denoted s u (G), is the total number of nontriv- 
ial lifted labels at all vertices minus the maximum number 



of identical lifted labels at each unlabeled vertex with 
non-trivial lifted labels (again, necessarily free roots). The 
UBSC of the history graph in Figure 2(B) is 5. For the AVG 
in Figure 5, LBSC = UBSC = 4. 

The module graph of a history graph G is a multi-graph 
in which the vertices are the sides of vertices in L(G) that 
have incident real or lifted adjacencies and the edges are 
the real and lifted adjacencies in L(G) incident with these 
sides. Each connected component in a module graph is 
called a module. The set of modules in the module graph 
for G is denoted M(G). Figure 4(B) shows the modules for 
Figure 4(A). 

The lower bound rearrangement cost (LBRC) for a his- 
tory graph G is: 

r l( G)= £ (r^l-1). 

MeM(G) 

For a history graph that is a simple history this definition 
is equivalent to the earlier definition of rearrangement 
cost for simple histories. 

The upper bound rearrangement cost (UBRC) of a his- 
tory graph G, denoted r M (G), is the total number of 
non- trivial lifted adjacencies in L(G) minus the number 
of modules in M(G) in which every side has exactly one 
incident non-trivial lifted edge. The LBRC of the history 
graph in Figure 2(B) is 3 and its UBRC is 6. For the AVG 
in Figure 5 LBRC = UBRC = 3. 

Theorem 1. For any history graph G and any cost func- 
tion c, c(si(G),ri(G)) < C(G,c) < c(s u (G), r u (G)) with 
equality ifG is an AVG. 
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Figure 5 The lifted graph for an AVG with (simple) modules containing non-trivial lifted adjacencies highlighted. Uses the same notation as 
in Figure 4(A). 



The proof is in given the Methods section. 

Theorem 1 demonstrates that LBSC and LBRC are lower 
bounds on cost, UBSC and UCRC are upper bounds on 
cost, and that all these bounds become tight at the point 
of zero ambiguity. This implies that to assess cost of an 
arbitrary history graph G we need only search for exten- 
sions of G to the point that they have zero ambiguity and 
not the complete set of simple history realisations of G. 
For an AVG H, as the lower and upper bounds on cost 
are equivalent, we write r(H) = ri(H) = r u (H) and 
s(H)= Sl (H)=s u (H). 

G-optimal AVGs 

We now explore the process of sampling AVG extensions 
of an initial starting graph. Though it is possible to start 
from any history graph, in practice we are likely to start 
from a history graph G based on sequence alignments, 
such as that shown in Figure 2(A). If G is already an AVG, 
by Theorem 1, it is trivial to assess its cost. If not we 
sample AVG extensions of G in order to assess cost and 
explore the set of most parsimonious realisations of G. 
With the aim of restricting this search, ultimately to a 
finite space, we first define the set of G-optimal AVGs. 

An AVG extension H of a history graph G is G- 
parsimonious w.r.t. a cost function c if C(G, c) = 
c(s(H),r(H)). The set of G-parsimonious AVGs is neces- 
sarily infinite: it is always possible to add arbitrary vertices 
without affecting substitution or rearrangement costs. To 
avoid the redundant sampling of AVG extensions of G and 
their own extensions we define the notion of minimality. 



An AVG extension H of G is G-minimal if there is 
no other AVG H' such that G < H' < H. The set of 
G-minimal AVGs contains those AVGs that can not be 
reduced without either ceasing to be AVGs or extensions 
of G. This set is also infinite for some DNA history graphs 
(Lemma 9 below). 

An AVG is G-optimal w.r.t. a cost function c if it is both 
G-parsimonious w.r.t. to c and G-minimal. We establish 
below that the set of G-optimal AVGs is finite for any his- 
tory graph G. By definition, any G-parsimonious AVG is 
either G-minimal or has a G-minimal reduction therefore 
we can implicitly represent and explore the set of parsi- 
monious realisations of G by sampling just the G-optimal 
AVGs. 



G-bounded history graphs 

Unfortunately, because the history graph cost problem is 
NP-hard, it is unlikely that there exists an efficient way 
to sample only G-optimal. Instead, we now define a finite 
bounding set that contains G-optimal and can be effi- 
ciently searched. Conveniently this bounding set is the 
same for all cost functions. 

A label of a vertex x is a junction (overloading the term 
junction, but using it analogously) if x has more than one 
lifted label, else it is a bridge if x has one lifted label, its 
lifted label is non-trivial, the most recent labeled ancestor 
of x is labeled the same as x and this ancestor has at least 
one non-trivial lifted label (see Figure 6(A,B)). 

A side Xq IS a bridge side if it is not a junction, is incident 
with one non-trivial lifted adjacency and an adjacency e 
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that defines a trivial lifted adjacency e' whose A(x a ) end- 
point is a junction side incident with a non- trivial lifted 
adjacency, and such that if e is deleted at least one end- 
point of e' in the original graph remains a junction side in 
the resulting graph (see Figure 6(C,D)). An adjacency is a 
junction (again, overloading the term junction) if either of 
its endpoints are junctions, else it is a bridge (overloading 
bridge) if either of its endpoints are bridge sides. 

An element is non-minimal if it is a branch with a 
free-child or free-parent, an isolated vertex, or label or 
adjacency that is not a junction or bridge. 

For G A G f , an element in G' is G -reducible if there 
exists a reduction operation in a reduction sequence from 
G' to G that either deletes the element if it is an adjacency, 
label or vertex or contracts it if it is a branch. We are inter- 
ested in the set of G-reducible elements of an extension 
of G, as they are the elements which may be added and 
removed during an iterative sampling procedure. 

For G A G the G-unbridged graph of G' is the reduc- 
tion resulting from the deletion of all G-reducible bridge 
adjacencies in G A side x a that has no attached descen- 
dants is a hanging side, A pair of adjacencies e and e f , each 
with a hanging side, and such that e has an endpoint whose 
most recent attached ancestor is incident with e' ', form a 
pair of ping-pong adjacencies. We call e the ping adjacency 
and e' the pong adjacency (Figure 6(E)). 

A history graph G' is G-bounded if it is an extension of G 
that does not contain a G-reducible non-minimal element 
and its G-unbridged graph does not contain a G-reducible 
ping adjacency. 

Theorem 2. The set of G-bounded AVGs contains the G- 
optimal AVGs for every cost function. 

The proof is given in the Methods section. 

Importantly, the following theorem demonstrates that 
there is a constant k such that any G-bounded history 
graph is at most k times the cardinality of G. 

Theorem 3. A G-bounded history graph contains less 
than or equal to max(0, lOn — 8) G-reducible adjacencies 



and max(0, 2m — 2, 20n — 16, 20n + 2m — 18) additional 
vertices, where n is the number of adjacencies in G and m 
is the number of labeled vertices in G. This bound is tight 
for all values ofn and m. 

The proof is given in the Methods section. 
The set of G-bounded history graphs and, by inclusion, 
the set of G-optimal AVGs are therefore finite. 

The G-bounded poset 

Finally we demonstrate how to navigate between G- 
bounded history graphs using a characteristic set of oper- 
ations that define a hierarchy between these graphs. 

For a vertex x in a G-bounded history graph the compos- 
ite minimisation of x is as follows: 

• If x is unattached and unlabeled and has a G-reducible 
parent branch, the contraction of the parent branch, 
renaming the resulting merged vertex x. 

• If x is then an unattached, unlabeled root and has a 
single G-reducible child branch, the contraction of 
the child branch, renaming the resulting merged 
vertex x. 

• The deletion of x if subsequently isolated, unlabeled 
and G-reducible. 

A G-bounded reduction operation on a G-bounded his- 
tory graph is one of the following operations, provided it 
results in a G-bounded history graph. 

• a label detachment: the deletion of a G-reducible 
label on a vertex x, followed by the composite 
minimisation of x (Figure 7(A-C)). 

• an adjacency detachment: the deletion of a 
G-reducible adjacency {x ai yp} followed by the 
composite minimisation of x and y (Figure 7(D-F)). 
The inverse of an adjacency detachment is an 
adjacency attachment 

• a lateral-adjacency detachment: the adjacency 
detachment of a pair of G-reducible junction 
adjacencies {x at yp} and {A(x a ),A(yp)} t and a 
subsequent adjacency attachment that creates an 
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adjacency that includes x a or yp as an endpoint 
(Figure 7(D-E)). 

Note that the first two G-bounded reduction opera- 
tions are combinations of reduction operations, while the 
lateral-adjacency detachment, which proves necessary to 
avoid creating intermediate graphs with G-reducible ping- 
pong edges, involves both reduction and extension opera- 
tions, but always reduces the total number of adjacencies. 
As with reduction operations, the inverse of a G-bounded 
reduction operation is a G-bounded extension operation. 
A G-bounded history graph G' is a G-bounded reduction 
(resp. extension) of another G-bounded history graph G" 
if G' is isomorphic to a graph that can be obtained from G" 
by a sequence of G-bounded reduction (resp. extension) 
operations. 

Lemma 6. The G-bounded reduction relation is a par- 
tial order 

The G-bounded poset is the set of G-bounded his- 
tory graphs with the G-bounded reduction relation. We 
write <q to denote the G-bounded reduction relation and 
-< 'G to denote its covering relation (i.e. A < >qB iff 
A <q B and there exists no C such that A <q C <q B). 

Theorem 4. The G-bounded poset is finite, has a sin- 
gle least element G, and its maximal elements are all 
AVGs. Also, G' < -qG" iff there exists a single G-bounded 
reduction operation that transforms G" into G 

The proof is given in the Methods section. 



As the G-bounded poset is finite, it can be represented 
by a Hasse diagram whose nodes are the G-bounded 
history graphs and whose edges, which are the cover- 
ing relation, represent equivalence classes of G-bounded 
operations. Figure 8 shows a simple G-bounded poset 
Hasse diagram. 

A basic implementation 

The previous four theorems establish the mechanics of 
everything we need to sample the finite set of G-optimal 
AVGs, and thus, amongst other things, determine the cost 
of a history graph. Although it will require further work 
to establish practical and efficient sampling algorithms, 
we have implemented a simple graph library in Python 
that for an input history graph G iteratively generates 
G-bounded AVGs (https://github.com/dzerbino/pyAVG) 
through sequences of G-bounded extension operations. 

To test the library we used simulations. For each simu- 
lation we generated a simple history H by forward simu- 
lation, starting from a genome with 5 vertices in a single 
thread and simulating 4 epochs in which either whole 
chromosome replication or rearrangements occurred and 
substitutions were made at a constant rate at each branch. 
The labels in the simulation correspond to single DNA 
bases. To ensure complexity, we selected histories with 
substitutions, rearrangements and at least two epochs of 
replication. We created a reduction G of H by removing 
from H all labels of internal vertices and adjacencies inci- 
dent on internal vertices and finally contracting the parent 
branch of all internal vertices. As a result, the reduced 
history contained only the leaf threads and branch trees 
that, containing no internal vertices, simply indicate the 
homologies between the vertices. To simulate incomplete 
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Figure 8 A Hasse diagram of the G-bounded poset for an example history graph. The graph (A) is G, the graphs in (A-F) represent the 
G-bounded extensions. 



genome assemblies, we randomly removed, on average, 
10% of the adjacencies, labels and vertices from these leaf 
threads. To test our library we enumerated sequences of 
G-bounded history graphs starting at G, at each step pick- 
ing at random a possible G-bounded extension operation 
until we reached an AVG. We sampled 20,000 starts for 
each of 20 randomly sampled pairs of history and starting 
graph. To make the search strategy efficient, we restarted 
the search if we reached an extension with a higher total 
sum of lower bound substitution and rearrangement costs 
than s u (G) + r M (G), initially, and then subsequently the 
sum of the substitution and rearrangement costs of the 
best AVG found up to that point. Tables 1 and 2 show 
the results of these 20 sampling runs. Figure 9 shows one 
example of H, G and a sampled AVG. 

For these simulations the minimum rearrangement cost 
of any sampled AVG is often close or equal to r/(G), 
while the maximum rearrangement cost of any sampled 
AVG is generally slightly greater than r u (G). Notably, we 
found that AVG extensions sometimes had lower cost 
than the original simple history, this occurring because of 
the information loss that resulted from reducing H to G. 

Repeating these experiments with histories that started 
with 10 root vertices in the simple history, but which 



were otherwise simulated identically, demonstrates that 
the naive random search procedure implemented here 
fails to find reasonable histories within a set of only 
20,000 random samples (data not shown), so, as might 
be expected, more intelligent sampling strategies will be 
needed to find parsimonious interpretations of even mod- 
erately complex datasets. However, with more efficient 
sampling algorithms, a history graph sampling algorithm 
could be applied to find solutions to various established 
parsimony problems, such as the DCJ median problem, or 
be used for less explored problems, such as the inference 
of gene trees incorporating synteny information. 

Discussion 

We have introduced a general model for genome evolu- 
tion under parsimony, but the reduction relation and the 
definition of the G-bounded set may appear arbitrary. We 
highlight below the reasons for our choice of reduction 
relation, how reduction relates to other orderings over 
graphs, and how we can easily approximate a set of G- 
reducible elements, something critical to the sampling of 
G-bounded extensions of a given graph. We then briefly 
discuss the possibilities of yet more compact graphical 
representations. 
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Table 1 Simulation results assessing substitution 
ambiguity and cost 
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Each row represents a separate initial history. The cost s(H) is the substitution 
cost of the simple history from which G is derived. Also given are the ambiguity 
u s (G), lower s/(G), and upper s u (G) substitution cost bounds for G. The minimum 
and maximum substitution costs of G-bounded AVG extensions found by 
sampling are denoted s(H smin ) and s(H smax ), resp. 



Table 2 Simulation results assessing rearrangement 
ambiguity and cost 
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Each row represents a separate initial history. The cost r(H) is the rearrangement 
cost of the simple history from which G is derived. Also given are the ambiguity 
u r (G), lower r/(G), and upper r u (G) rearrangement cost bounds for G. The 
minimum and maximum rearrangement costs of G-bounded AVG extensions 
found by sampling are denoted r(H rmin ) and r(H rmax ), resp. 



In the reduction relation, we allow the deletion of ver- 
tices, vertex labels and adjacencies, but forbid branch 
deletion. Otherwise, extensions would allow the inven- 
tion of homology between vertices (see Figure 10(A)). 
Conversely, branches can be contracted but not adjacen- 
cies, otherwise extensions could create interstitial vertices 
without any rearrangement (see Figure 10(B)). 

We disallow the non-trivial contraction of the incom- 
ing branch of attached or labeled vertices, with the 
one exception for branches with free-parents, because 
it would allow a reduction to merge previously separate 
threads (see Figure 10(C)), and because vertices could 
be reduced to become ancestors of originally indirectly 
related vertices (see Figure 10(D)). We allow the one 
exception for the contraction of the incoming branch of 
attached or labeled vertices when the branch has a free- 
parent because disallowing it would forbid reductions that 
removed information from root vertices (see Figure 10(E)) 
and allowing it does not permit the issues highlighted in 
Figures 10(C-D). 

It is informative to consider the relationship between 
reduction operations and the reduction relation. When a 



graph contains multiple copies of isomorphic structures, 
distinct reduction operations can result in isomorphic 
reductions (see Figure 10(F-I)), therefore each possible 
reduction in the covering set (transitive reduction) of 
the reduction relation represents an equivalence class of 
reduction operations. 

A valid permutation of a reduction sequence is a permu- 
tation in which all operations remain reduction operations 
when performed in sequence. Clearly not all permutations 
of a reduction sequence have this property, however the 
following lemma illustrates the relationship between valid 
permutations. 

Lemma 7. All valid permutations of a reduction 
sequence create isomorphic reductions. 

Reduction is somewhat analogous to a restricted form 
of the graph minor. Importantly, the graph minor is a well- 
quasi-ordering (WQO) ([35]), i.e. in any infinite set of 
graphs there exists a pair such that one is the minor of the 
other. 
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Figure 9 History graph examples generated by simulation. (A) H, (B) G, (C) An example of H rm j n and H S m/n- Example corresponds to experiment 
1 in Tables 1 and 2. The G-bounded extension sequence from G to this AVG involved the creation of just 7 adjacencies, 5 vertices and 7 labels. Graph 
layouts were computationally derived. 



Lemma 8. Reduction is not a WQO. 

Proof. Consider the infinite set of cyclic threads, they 
are not reductions of one another. □ 

An ordering is a WQO if every set has a finite subset 
of minimal elements. In contrast, it can be shown that for 
the reduction relation, even the set of AVG extensions of 
a single base history G can have an infinite set of minimal 
elements. 

Lemma 9. There exists a history graph G with an infinite 
number of G-minimal extensions. 

The proof is given in the Methods section. 



One barrier to exploring the G-bounded poset is decid- 
ing for a pair of history graphs G and G' such that G A G' 
if an element is G-reducible. This problem is of unknown 
complexity, and may well be NP-hard. To avoid the poten- 
tial complexity of this problem we can define an alterna- 
tive notion of reducibility. A fix for (G, G f ), where G A G' , 
is a history subgraph of (Vc? , E& , B^,) isomorphic to G, 
where B J, is the transitive closure of Bq> . Starting from 
an input history graph G and a fix isomorphic to it, we 
can easily update the fix as we create extensions of G. For 
an extension of G, elements in the fix become the equiva- 
lent of G-irreducible, while elements not in the fix become 
the equivalent of G-reducible. From a starting graph we 
can therefore explore a completely analogous version of 
G-bounded, replacing the question of G-reducibilty with 
membership of the fix. 
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Figure 10 Justifying the reduction relation. (A,B,C,D) The graphs on the left side are not reductions of the graphs on the right. (E) The graph on 
the left is a reduction of the graph on the right. (F,G,H,I) Examples of equivalence classes of reduction operations, where multiple distinct reduction 
operations result in the same reduction. 



Following from Lemma 7, there is a bijection between 
the set of fixes for G A G' and the set of equivalence 
classes of reduction sequences that are all valid permuta- 
tions of each other. This is the limitation of considering 
membership of a fix instead of assessing if an element 
is G-reducible, it limits us to considering only a single 
equivalence class of reduction sequences in exploring the 
analogous poset to G-bounded. 

It is in general possible to reduce the size of the set G- 
bounded while still maintaining the properties that it can 
be efficiently sampled and contains G-optimal. However, 
this is likely to be at the expense of making the defini- 
tion of G-bounded more complex. One approach is to add 
further "forbidden configurations" to the definition of G- 
bounded, like the G-reducible ping adjacencies that are 
forbidden in the current definition of G-bounded. Forbid- 
ding these was essential to making G-bounded finite, but 
we might consider also forbidding other configurations 
just to make G-bounded smaller. 

It is possible to consider a graph representation of his- 
tories that use fewer vertex nodes if we are willing to 
allow for the possibility that a subrange of the sequence 
of a vertex be ancestral to a subrange of the sequence 
of another vertex. This is a common approach in ances- 
tral recombination graphs ([11]). Such a representation 
entails the additional complexity of needing to specify the 
sequence subranges for every branch, but may in some 
applications be a worthwhile trade off for reducing the 
number of vertices in the graph. The theory of such graphs 



is mathematically equivalent to the theory of the his- 
tory graphs presented here, but the implementation would 
differ. 

Conclusion 

We have introduced a graph model in which a set of 
chromosomes evolves via the processes of whole chro- 
mosome replication, gain and loss, substitution and DCJ 
rearrangements. We have demonstrated upper and lower 
bounds on maximum parsimony cost that are trivial to 
compute despite the intractability of the underlying prob- 
lem. Though these cost bounding functions are relatively 
crude and can almost certainly be tightened for many 
cases, they become tight for AVGs. This implies that we 
only need to reach AVG extensions to assess cost when 
sampling extensions. 

To our knowledge, this is the first fully general model 
of chromosome evolution by substitution, replication, and 
rearrangement. However, it has its limitations. For exam- 
ple, it treats common rearrangements, such as recom- 
binations and indels as any other rearrangement, and 
only takes into account maximum parsimony evolutionary 
histories. We anticipate future extensions that incorpo- 
rate more nuanced cost functions, as well as probabilistic 
models over all possible histories. 

The constructive definition of the G-bounded poset, 
coupled with the upper and lower bound functions, sug- 
gests simple branch and bound based sampling algorithms 
for exploring low-cost genome histories. To facilitate the 
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practical exploration of the space of optimal and near 
optimal genome histories, we expect that more advanced 
sampling strategies across the G-bounded poset could be 
devised. 

Methods 

Proof of Theorem 1 

We first define some convenient notations to describe 
lifted labels and edges. For a vertex x let L x = (L X ,N X ) 
be its multiset of lifted labels, where L(x) is the set of dis- 
tinct lifted labels for x, and for each lifted label p, N x (p) 
is the number of times p appears as a lifted label for x, i.e. 
L x = {l(y) : A{y) = x} c E* and N x : L x -> Z + such that 
N x (p) = \{y:A(y)=x,l(y) = p}\. 

For a side x a , and overloading notation, let L Xa = 
(L Xa ,N Xa ) be its multiset of lifted edges, where L(x a ) is the 
set of distinct lifted adjacencies incident with x a , and for 
each lifted adjacency {x a ,w y }, N Xa ({x a , w y }) is the num- 
ber of sides whose lifting ancestor is x a , and which are 
connected by an adjacency to a side whose lifting ancestor 
is w y , i.e. L Xa = {{x a = A(y a ),A(zp)} : \y a ,zp) e E G ) and 
N Xa = L Xa -> Z + such that N Xa ({x a ,w y }) = \{y a : {x a = 
A(y a ),w y } e L X J\. 

Note that for a side x a , N Xa ({x a ,w y }) gives the mul- 
tiplicity of lifted adjacency incidences with x at not the 
multiplicity of {x a , w y }. In particular, if two sides x a and 
xf a are attached and share the same lifting ancestor A (x a ), 
then NA( Xa )({A(x a ),A(x a )}) is incremented by 2. On the 
contrary, if x a is connected to w y and A(x a ) is dis- 
tinct from A(w y ), then both NA(x a )({A(x a ),A(Wy)}) and 
NA(w Y) ({A(x a ),A(w y )}) are incremented by 1. 

For a vertex (resp. side) x the multi-set of non-trivial 
lifted labels (adjacencies) is L x — (L X ,N X ) c L x . 

The equivalence ofLBSC to UBSC and LBRC to UBRC forAVGs 
Lemma 10. For any AVG H, s t (H) = s u (H). 

Proof For a vertex x without substitution ambiguity 
there is at most one non-trivial lifted label, that, if it exists, 
has a multiplicity of one, therefore \L X \ = \L X \ = 0 or 1. 
Let 8 at b be the Kronecker delta, i.e. 8 at b = 1 if a = b, else 
0. It is easily verified for every possible case: 

max(0, \L X \ - 8 Kx))0 ) = \L X \ - 8 Kx))0 x maxA^(p), 

P eV x 

summing over modules, therefore: 

si(H) = ^ max(0, \L X \ - 8 Kx))0 ) 
xeV UH ) 

= \L'x\ 8 Kx),0 x mzxN x (p) = s u (H). 



A module is simple if each side has at most one inci- 
dence with a non- trivial lifted adjacency. 

Lemma 11. All modules in an AVG are simple. 

Proof. Follows from definition of rearrangement ambi- 
guity. □ 

Lemma 12. For an AVG H, r t (H) = r u (H). 

Proof. Let M be a simple module and let I<m = 
^2x a eVM^i\V \' he * tne num t )er °f sides in Vm with a 
single incidence with a non-trivial lift. 

As the module is simple it is a path or a cycle, and hence 
I Vm\ — I<M = 0, 1 or 2, from which it is easily verified that: 

r 2 1_1 = - 11 8 W Xa \' 

Xa^VM 

Summing over modules in H, which are all simple, 
therefore: 

E ri^f^i-i- E - n Kks 



MeM(H) 



MeM(H) x a eV M 



As all modules of H are simple, I<m is always even and 
! <M = J2 Xa eV M \ l 'x a \> therefore: 

MeM(H) 7 x a eV M MeM(H) x a eV M 

therefore, for an AVG H 



MeM(H) ^ ' 

MeM(H) \ x a eVM x a eV M / 



xeV L (H) 



P eL' x 



□ 



□ 



A bounded transformation of a history graph into an AVG 

In this section we will prove that any history graph G 
has an AVG extension H such that s u (G) > s u (H) 
and r u (G) > r u (H). To do this we define sequences of 
extension operations that when applied iteratively and 
exhaustively construct such an extension. 

A vertex or side x is ambiguous if \L X \ > 1. For 
an ambiguous free-root x' and unlabeled root vertex x 
such that A(x) = xf, a root labeling extension is a 
labeling of x with a member of the set arg maxA/^/ (p) 

(See Figure 11(A)). 
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Figure 1 1 Extension operations used to demonstrate the upper bound. (A) A root labeling extension. (B) A substitution ambiguity reducing 
extension. (C-D) Junction side attachment extensions. (E) A rearrangement ambiguity reducing extensions. Elements in red/outlined in red are 
those added in extension. 



Lemma 13. For any history graph G containing an 
ambiguous free-root there exists a root labeling extension 
G' of G such that s u (G) = s u (G r ), r u (G) = r u (G r ) and 
u(G) > u{G'\ 

For a branch (x,x f ) an interpolation is the extension 
resulting from the creation of a new vertex x" and 
branches (x,x") and (xf',xf) and the deletion of (x,x r ). Let 
x be a labeled and ambiguous vertex and x' be a labeled 
vertex such that A(x f ) — x and l(x) ^ l(x f ). A substitu- 
tion ambiguity reducing extension is the interpolation of a 
vertex x" along the parent branch of x' labeled with l{x) 
(See Figure 11(B)). 

Lemma 14. For any history graph G containing no 
ambiguous free-roots and such that u s (G) > 0, there exists 
a substitution ambiguity reducing extension G' of G such 
thats u {G) = s u {G'\ r u (G) = r u {G') and u(G) > u(G f ). 

The following is used for eliminating rearrangement 
ambiguity. For an unattached junction side x a a junction 
side attachment extension is the extension resulting from 
the following: If x a has no attached ancestor, the cre- 
ation of a new vertex and adjacency connecting a side of 
the new vertex to x a (see Figure 11(C) for an example), 
else {A(x a )>yp} € Eg and the extension is the creation 



of a new vertex /, branch (j,/) and adjacency [x ai y'p] 
(See Figure 11(D)). 

Lemma 15. For any history graph G containing an 
unattached junction side, there exists a junction side 
attachment extension G' of G such that s u (G) — s u (G'), 
fu(G) > r u (G'), u(G) > u(G') and G' contains one less 
unattached junction side than G. 

Let {x at yp} and {A(x a ),z y } be a pair of adjacencies and 
A(x a ) be ambiguous. A rearrangement ambiguity reducing 
extension is the interpolation along the parent branch of 
x a vertex x f , the creation of a new vertex z' new branch 
(z,z f ) and new adjacency {x' a ,z' y } (See Figure 11(E)). 

Lemma 16. For any history graph G containing no 
unattached junction sides and such that u r (G) > 0, there 
exists a rearrangement ambiguity reducing extension G' of 
G such that s u (G) = s u (G f ), r u (G) > r u (G f ) and u(G) > 
u{G'\ 

We can now prove the desired lemma. 

Lemma 17. Any history graph G has an AVG extension 
H such that s u (G) > s u (H) and r u (G) > r u (H). 



Paten etal. BMC Bioinformatics 2014, 15:206 
http://www.biomedcentral.eom/1 471 -21 05/1 5/206 



Page 17 of 31 



Proof. Using the previous 4 lemmas it is easily verified 
the result of the following algorithm is an AVG extension 
H for a history graph G such that s u (G) > s u (H) and 
r u (G)>r u (H). 



H ^ G 

while u(H) > 0 do 

if H contains an ambiguous free-root then 

H <— root labeling extension of H. 
else 

if u s (H) > Othen 

H <- substitution ambiguity reducing extension 

ofH. 
else 

if H contains an unattached junction side then 
H <- junction side attachment extension of 
H. 

else 

H <— rearrangement ambiguity reducing 
extension of H. 
end if 
end if 
end if 
end while 



□ 

A bounded transformation of an AVG into a realisation 

In this section we will prove that any AVG H has a 
realisation H such that si(H) = s(H) and ri(H) = r(H). 

A vertex connected by an adjacency to another ver- 
tex with more child branches has missing children. A 
root vertex that is connected to a non-root vertex has 
a missing parent. Missing parents and missing chil- 
dren are collectively missing branches. An unattached 
side with homologous attached sides has a missing 
adjacency. 

We will define a series of extension types that when 
combined iteratively create an extension in which all ver- 
tices are labeled and no elements have missing adjacen- 
cies or branches. For each extension type defined below 
Figure 12 shows an example. 

For an attached root vertex x, the creation of a new ver- 
tex x' and branch (xf, x) is a case 1 extension. The case 1 
extension is used iteratively to initially ensure all roots are 
unattached. 

For an attached leaf vertex, the creation of a new ver- 
tex x' and branch (x, x f ) is a case 2 extension. The case 2 
extension is used iteratively to initially ensure all leaves are 
unattached. 

For a side x a if A(x a ) is in a module M, x a is in the face of 
M. Let M be a simple module containing an odd number 
of sides and let x a be an unattached root side in the face 
of M. The following is a case 3 extension: the creation of a 



pair of vertices y and /, an adjacency connecting a side of 
y to Xqi and the branch (y,y f ). The case 3 extension is used 
iteratively to ensure all modules contain an even number 
of sides. 

Similarly to vertices and sides, a thread X is ances- 
tral to a thread Y in a history graph G, and reversely Y 
is a descendant of X, if there exists a directed path in 
D(G) from the vertex representing X to the vertex rep- 
resenting y, otherwise two threads are unrelated if they 
do not have an ancestor/descendant relationship. For a 
vertex x, T(x) is the thread it is part of. For a pair of 
unattached root sides x a and yp in the face of a sim- 
ple module such that T(x) = T(y) or T(x) and T(y) 
are unrelated, the creation of a new adjacency {x a ,yp} 
is a case 4 extension. The case 4 extension is used 
iteratively to ensure all modules contain attached root 
sides. 

Let x a be a side in the face of a simple module M such 
that Xqi IS internal, unattached and has an attached par- 
ent. Let (y, / ) be a branch such that y^ is a side in the 
face of M, T(y) is not descendant of T(x), if T(y) = 
T(x) then y is unattached, T(y f ) is descendant or unre- 
lated to T(x), and the sides A(x a ) and A(y f ^) in M are 
connected by a path containing an odd number of adja- 
cencies/lifted adjacencies. If yp is unattached and T(y) is 
unrelated or equal to T(x) then the creation of the adja- 
cency {x a , yp] is the case 5 extension, else the interpolation 
of a vertex y" on the branch (y,y f ) and creation of the 
adjacency {x a) y"p \ is the case 5 extension. The case 5 exten- 
sion is used iteratively to ensure all internal vertices are 
attached. 

For an adjacency {x a , yp} such that y has fewer chil- 
dren than x, the creation of a new vertex y' and branch 
(y,y f ) is a case 6 extension. The case 6 extension is used 
iteratively to ensure there are no vertices with missing 
children. 

Let x a and yp be a pair of unattached leaf sides in the 
face of a simple module M such that T(x) and T(y) are 
unrelated or equal, A(x a ) and A(yp) are attached and are 
either connected by an adjacency or both not incident 
with a non-trivial lifted adjacency. The creation of a new 
adjacency {x a ,yp} is a case 7 extension. The case 7 exten- 
sion is used iteratively to ensure there are no leaf vertices 
with missing adjacencies. 

For a branch-tree containing no labeled vertices, the 
labeling of any single vertex in the branch-tree with 
a member of E* is a case 8 extension. For a branch 
(x,y), such that y is labeled and x is unlabeled the 
labeling of x with the label of y is a case 9 exten- 
sion. For a branch (x,y), such that x is labeled and y 
is unlabeled the labeling of y with the label of x is 
a case 10 extension. The case 8, 9 and 10 extensions 
are used iteratively to ensure there are no unlabeled 
vertices. 
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Figure 12 Case 1 to 10 extensions. Adjacencies with lightning bolts may be expanded to include additional elements within the simple module. 
Elements in red/outlined in red are those added in extension. 



Lemma 18. For an AVG H, ifH' is obtained from H by 
any of the 10 extensions cases above then si(H) = si(H f ) 
andr l (H) = r l (H'). 

Lemma 19. For an AVG H, each of the ten types of 
extensions above can only be applied consecutively a finite 
number of times until there are no more opportunities in 
the graph to apply an extension of that type. 



Lemma 20. Any AVG H has an AVG extension H' with 
no missing labels, adjacencies or branches and such that 
si(H) = 5/(7/0 andriQf) = n(H f ). 

Proof We will demonstrate that the following algorithm 
converts an AVG into an AVG with no missing adjacencies 
or branches or unlabeled vertices. 
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H' • 

/ <r- 1 

while i < 10 do • 
while //'has a case /extension do 

H ' <— case / extension of H ' 
end while 

i <r- i + 1 

end while 



It follows from Lemma 19 that the algorithm always ter- 
minates and from Lemma 18 that H' is an AVG such that 
s l (H)=s l (H>) 2ndn(H) = n(H'). 

It remains to prove that H' has no missing branches or 
adjacencies or unlabeled vertices. Call the AVG extension 
resulting at the end of the ith loop of line 5 of the algo- 
rithm the case i complete extension. The following series 
of compounding statements are straightforward to verify. 

• The case 3 complete extension contains no modules 
with an odd number of sides. 

The case 2 extensions ensure that all root vertices are 
unattached, and every case 3 extension attaches a 
root vertex in a module with an odd number of sides 
to a newly created root vertex, so ensuring the 
module contains an even number of sides, so for 
every module with an odd number of sides there 
exists a case 3 extension. 

• The case 4 complete extension additionally contains 
no root sides with missing adjacencies or root 
vertices with missing parents. 

The case 3 extensions ensure that there always 0 or 2 
unattached root sides in a module, so any unattached 
root side in a module always has a potential 
unattached partner root side within the module. The 
requirement that sides connected in a case 4 
extension be in the same or unrelated threads prior to 
connection does not prevent any root side within the 
face of a module from becoming attached, because 
the case 1 extensions ensure that all root vertices are 
unrelated, the case 2 extensions do not effect root 
vertices and the case 3 and 4 vertices only result in 
root vertices being connected to one another. 

• The case 5 complete extension additionally contains 
no internal vertices with missing adjacencies. 

The case 4 extensions ensure that all root sides within 
modules are attached. The case 2 extensions ensure 
that all attached sides have children and the case 3, 4 
and 5 extensions ensure this remains true. Given this, 
and that every module has an even number of sides 
within it (as a case 3 complete extension), it is 
straightforward to verify that there is always a case 5 
extension in a sequence of such extensions for any 
internal side within the face of a module. 



The case 6 complete extension additionally contains 
no vertices with missing child branches. 
The case 7 complete extension additionally contains 
no leaf sides with missing adjacencies, and therefore 
has no missing branches or adjacencies. 
Analogously with the case 4 extensions, the 
requirement that sides connected in a case 7 
extension be in the same or unrelated threads does 
not prevent any leaf side within the face of a module 
from becoming attached by a case 7 extension, this is 
because the case 2 extensions ensure all leaf vertices 
are unrelated, the case 3, 4, 5 and 6 extensions do not 
connect leaf vertices, and the case 7 extensions only 
connect leaf sides to one another. 

• The case 8 complete extension additionally contains 
no branch-trees without any labeled vertices. 

• The case 9 complete extension additionally contains 
no unlabeled ancestral vertices that have labeled 
descendants. 

• The case 10 complete extension additionally contains 
no unlabeled vertices, and therefore has no missing 
adjacencies, branches or labels. 

□ 

We can now prove the desired lemma. 

Lemma 21. Any AVG H has a realisation H such that 
si(H) = s(H) and n(H) = r(H). 

Proof. Lemma 20 demonstrates there exists an AVG 
extension H f of H with no missing labels, adjacencies or 
branches such that s t (H) = si(H') and n(H) = r(H'). 
H f is converted to a simple history with the same cost as 
follows. 

• On every branch of H ' interpolate a vertex. 

• Label each interpolated vertex identically to its parent. 

• Connect the sides of the interpolated vertices to one 
another such that for any adjacency {x ai yp} 
connecting interpolated vertices, 
{A(x a ),A(y p )} eE H >. 

It is easily verified that the result is an AVG that can 
be edge partitioned into rearrangement and replication 
epochs and hence is a simple history. □ 

LBSC and LBRC are lower bounds 
Lemma 22. LBSC is a lower bound on substitution cost 

Proof. From Lemmas 17 and 21 it follows that every his- 
tory graph has a realisation. It is sufficient therefore to 
further prove that for any simple history H, 5(H) = s/(H) 
and that a history graph G has no extension G' such that 
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s/(G) > 5/(G r ). The former is easily verified and we now 
prove the latter. 

Let (G = G n ) < G n -i < ...G 2 < (Gi = G f ) be 
a sequence of n history graphs for a reduction sequence 
of n — 1 reduction operations. For some integer i e [1, n) 
if the ith reduction operation is a vertex deletion, adja- 
cency deletion or branch contraction, as these each have 
no impact on the calculation of LBSC, s/(G; + i) = s/(G;). 
Else the ith reduction operation is a label deletion. Let x 
be the vertex whose label is being deleted. As the number 
of non- trivial lifted labels for A(x) after the deletion of x is 
less than or equal to the sum of non-trivial lifted labels for 
x and A(x), it follows thats/(G; + i) < s/(G;). Therefore by 
induction 5/ (G) < 5/ (GO- □ 

Lemma 23. LBRC is a lower bound on rearrangement 
cost 

Proof. Analogously to the proof of Lemma 22, from 
Lemmas 17 and 21 it follows that every history graph has 
a realisation. It is sufficient therefore to further prove that 
for any simple history H, r(H) = r/(H) and that a history 
graph G has no extension G' such that r/(G) > 77 (G')- The 
former is easily verified and we now prove the latter. 

Let (G = G n ) < G n — 1 < . . . G2 < (Gi = G ! ) be a 
sequence of n history graphs for a reduction sequence of 
n — 1 reduction operations. For some integer i e[l,n) if 
the ith reduction operation is a label deletion, vertex dele- 
tion or contraction of a branch with a free-parent, as each 
removes an element that has no effect on the calculation 
of the LBRC,r / (G /+ i) = r / (G / ). 

Else if the ith reduction operation is a contraction of 
a branch with a free-child, as the child is unattached the 
only possible effect on the LBRC calculation is the conver- 
sion of non-trivial lifted adjacencies into trivial lifted adja- 
cencies, therefore r/(G/ + i) < r/(G/) (see Figure 13(A)). 

Let q(M) and p(M) be the number of unattached and 
attached sides in a module M, as q(M) + p(M) = Vm- 

n(G)= J2 KqM+pM)/z\-i. 

MeM(G) 

As each side may be incident with at most one adjacency 
p(M) is even and p(M)/2 is the number of adjacencies in 
M, therefore: 

r/(G) = \E G \+ J2 \«(M)/2\-l. 

MeM(G) 

Hence r/(G) = \E G \ + Q(G) - \M(G)\, where Q(G) = 
J2meM(G) \q(M)/2\. Suppose r/(G/+i) > r/(G/). If the ith 
reduction operation is an adjacency deletion, \E Gi+1 | + 
1 = \E Gi \, therefore Q(G /+ i) - \M(G M )\ > Q(G/)- 
|M(G0l+2. 



The removal of an adjacency can reduce the number of 
modules by at most two, therefore \M(G£) | — |Af (Gj+i) | < 
2. The number of modules decreases by the maximum of 
two only when the adjacency to be deleted connects two 
sides that each have no incident lifted adjacencies (see 
Figure 13(B)). However, in this case Q(G/) = Q(G/ + i) + 1, 
as the number of unattached sides in a module decreases 
by 2, therefore if |Af(G/) | - |Af (G/+i) | = 2 then r/(G/+i) < 

n(Gi). 

An unattached side in a module is the side of a free-root, 
and such a free-root side has incident lifted adjacencies. 
The side of a free-root with no incident lifted adjacencies 
can not become part of a module by the removal of any 
adjacency from the associated history graph, as by defi- 
nition the homologous sides in its associated branch-tree 
are all unattached. The removal of an adjacency can there- 
fore only decrease or leave the same the total number of 
unattached sides in modules. The only way for Q(G/ + i) — 
Q(G/) to be positive is therefore by the redistribution of 
unattached sides between modules to exploit the ceiling 
function. As in the removal of a single adjacency at most 
two unattached sides can be redistributed from a single 
module (see Figure 13(C)), therefore Q(G/+i) — Q(G/) < 1. 
But if Q(G/ + i) — Q(G/) = 1 then it is easily verified 
\M(Gi)\ - \M(Gi+i)\ < 0. This is all the cases, there- 
fore r/(Gj+i) < r/(G/), by induction therefore 77(G) < 
ri(G f ). ^ □ 

Theorem L For any history graph G and any cost func- 
tion c } c(5/(G), 77(G)) < C(G, c) < c(s u (G),r u (G)) with 
equality ifG is an AVG. 

Proof Follows from Lemmas 10, 12, 17, 21, 22 and 23. 

□ 

Proof of Theorem 2 

We first classify non-minimal adjacencies and labels. 
A non-minimal label of a vertex x is (see Figure 14(A)): 

• A leaf if U x = {}, 

• else, as it is not a junction, \L' X \ = 1 and: 

- the label is redundant \iL' x = {}, 

- else complicating if l(A(x)) 7^ l(x), 

- else l(A(x)) = l(x) and, as it is not a bridge, 
then La(x) = U an d it is an unnecessary 
bridge. 

A non-minimal adjacency {x a ,yp} is (see Figure 14(B)): 

• a leaf if I^U 1^ = 0, 

• else, as it is not a junction, neither x a or yp are 
junction sides and it is complex if\L x | > 1 or 

\L'y,\>l, 

• else |4J < 1,\L' \ < land: 
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Figure 13 Demonstrating the lower bound. (A) A contraction of a branch with a free-child can only possibly result in non-trivial adjacencies 
becoming trivial. (B) An adjacency deletion can at most reduce the number of modules by 2, and if the number of modules decrease by two then 
the number of unattached sides in a modules decreases by 2. (C) An example of the deletion of an adjacency redistributing two unattached sides. 



the adjacency is redundant if 

L Xa ULyp = {{x a ,yp}}, 

else complicating if {A(x a ),A(yp)} is a 

non- trivial lifted adjacency, 

else {A(x a ),A(yp)} is a trivial lifted adjacency 

and, as it is not a bridge either: 

unnecessary bridge, 
* ehe(L' MXa) UV Xa )\{{Xa 9 y p }} <1 
and(L' AW UL' yf} )\{{x a ,yp}}<l 
and it is a removable bridge. 



Lemma 24. A G -minimal AVG contains no G-reducible 
non-minimal elements. 



Proof. We prove the contrapositive. It is easily verified 
that the deletion of any single non-minimal vertex or con- 
traction of a non-minimal branch from an AVG results 
in a reduction that is also an AVG. It is also easily veri- 
fied that the deletion of each possible type of non-minimal 
label/adjacency from an AVG results in a reduction that 
is also an AVG, with the exceptions of a complex non- 
minimal adjacency, which can not be present within an 
AVG (because such an edge implies ambiguity), and a 
removable bridge adjacency. After deletion of a removable 
bridge adjacency {x a ,yp} the adjacency {A(x a ),A(yp)} 
ceases to be a junction adjacency, and may either become 
a bridge, in which case the resulting graph is an AVG, or 
it may become a non-minimal adjacency. If it becomes a 
non-minimal adjacency, then, by the prior argument, if 
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4 h 
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Figure 14 Further classifying labels and adjacencies. (A) Classification of labels. From left-to-right labels of vertices outlined in red are: leaf, 
junction, (another) junction, redundant, complicating, unnecessary bridge and bridge. (B) Classification of adjacencies. From left-to-right 
adjacencies in red are: leaf, junction, complex, redundant, complicating, unnecessary bridge, removable bridge and bridge. 
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it is not a removable bridge adjacency then its deletion 
results in an AVG, else if it is a removable bridge then after 
the deletion of {A(x a ),A(yp)} } the process of considering if 
{A(A(x a ))>A(A(yp))} is non-minimal and deleting if nec- 
essary is repeated iteratively until the resulting graph is an 
AVG. □ 

Lemma 25. The only G-reducible adjacencies in the G- 
unbridged graph of an extension of G containing no non- 
minimal elements are junction adjacencies. 

Proof By definition, the only G-reducible adjacencies 
in an extension of G with no G-reducible non-minimal 



elements are junction adjacencies and bridges. Each dele- 
tion of a G-reducible bridge adjacency does not create any 
G-reducible non-minimal adjacencies, as a junction adja- 
cency connecting sides that are the lifting ancestors of 
the sides connected by a bridge adjacency remains a junc- 
tion adjacency after the deletion of the bridge, and the 
lifted adjacencies incident with the sides connected by the 
bridge, which are non-trivial, lift to this junction instead 
and therefore remain non-trivial. □ 

Lemma 26. The G-unbridged graph of a G-optimal 
AVG for any cost function contains no G-reducible ping 
adjacencies. 



H 








Figure 1 5 Examples of ping-pong adjacencies. (A,B,C) The ping-pong adjacencies are shown in blue. (D,E,F) After modifications to remove the 
ping-pong adjacencies, for each corresponding left side case, with the added elements shown in red. 
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Proof. Let H be a G-optimal AVG whose G-unbridged 
graph H' contains one or more G-reducible ping adjacen- 
cies. Example subgraphs containing ping-pong adjacen- 
cies are shown in Figure 15(A-C). Let e = {x a , yp] be such 
a G-reducible ping adjacency and yp a hanging endpoint 
in H' . From Lemma 25, the adjacency e must be a junc- 
tion. Delete e from H' giving H", note G A H" . If x a 
has no most recent attached ancestor create a new ver- 
tex and connect it with an adjacency to oCqi as shown in 
Figure 15(D), else do the same but connect the new vertex 
by a branch that makes it the child of the vertex connected 
by an adjacency to the most recent attached ancestor of 

, as shown in Figure 15(E). Note that it does not matter 
in this second case if the most recent attached ancestor of 
x a is a pong adjacency, as demonstrated in Figures 15(C) 
and (F). It is easily verified that each modification defines 
the G-unbridged graph of a valid AVG extension H'" of 
G that has one fewer G-reducible ping adjacencies in its 
G-unbridged graph, one less rearrangement and the same 
number of substitutions in its most parsimonious realisa- 
tion as in the most parsimonious realisation of H. This 
contradiction to the assumption that H was G-optimal 
establishes the result. □ 

Theorem 2. The G-bounded AVGs contain the G- 
optimal AVGs for every cost function. 

Proof. Follows from Lemmas 24 and 26. □ 

Proof of Theorem 3 

In the following let n be the number of adjacencies in a 
history graph G. 

Lemma 27. If n = 0 any G-bounded extension of G 
contains 0 adjacencies. 

Proof. Follows from Lemma 24. □ 

As the n = 0 case is trivial now assume that n > 1. 
For an adjacency {x a , yp} its received incidence is \V | + 
\Ly | and its projected incidence is equal to the number 
of members of {A(x a ),A(yp)} that are attached, either 
0, 1 or 2. For an adjacency, the difference between pro- 
jected incidence and received incidence is the incidence 
transmission. A positive incidence transmission occurs 
when the projected incidence is greater than the received 
incidence number, conversely a negative incidence trans- 
mission occurs when the projected incidence is less than 
the received incidence. The incidence sum of a history 
graph is the sum of the received incidences of its adjacen- 
cies, or, equivalently, the sum of the projected incidences 
of its adjacencies. 



Lemma 28. The maximum possible incidence sum of G 
is 2n - 2. 

Proof. The 2n term is because each adjacency has a pro- 
jected incidence of at most 2, the —2 term is because at 
least one adjacency has a projected incidence of 0. □ 

It is trivial to show this bound can be achieved for all 
values of n. 

Lemma 29. The G-unbridged graph G" for a G-bounded 
history graph G' has no G-reducible adjacencies with a 
positive incidence transmission. 

Proof. By Lemma 25, the only G-reducible adjacencies 
in G" are junction adjacencies. Junction adjacencies have 
an incidence transmission of 0 or less. □ 

Lemma 30. The G-unbridged graph G" for a G-bounded 
history graph G' contains less than or equal to 2n — 1 
adjacencies that either have a negative incidence trans- 
mission, or which are G-irreducible and have an incidence 
transmission ofO. 

Proof. Let be the number of adjacencies in G" that 
have a projected incidence of i and a received incidence 
of As the sum of projected incidences equals the sum of 
received incidences therefore: 

2 co 2 co 

J2J2 iki >j = 12J2j ki >j 

i=0 ;'=0 i=0 7=0 




Figure 16 Demonstrating a larger graph. Ignoring the grey 
elements, if there exists an adjacency e in a G-bounded and 
G-unbridged graph with no projected incidences and one received 
incidence, it must be G-irreducible and a larger graph, shown by the 
elements in grey, exists that is /-/-bounded for a graph H with the 
same number of adjacencies as G. 
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Figure 17 Demonstrating Lemma 31. (A) A junction adjacency with incidence transmission of 0 in a G-bounded AVG. (B) Ignoring the grey 
elements, if the adjacency e in a G-bounded graph is a G-reducible junction adjacency with no projected incidences and no hanging endpoints, a 
larger graph, shown by the elements in grey, exists that is /-/-bounded for a graph H with the same number of adjacencies as G. 



Separating the contributions of adjacencies with a neg- The first term of the left-hand side of the equation is the 

ative incidence transmission: total incidence transmission of all adjacencies in G" with 

a positive incidence transmission. Using Lemma 29, these 

2 i 2 oo 2 i 2 oo adjacencies must all be G-irreducible. Let k be the num- 

Uqj _j_ i/qj — ^2,jkij + Ai/> ^ er °^ -irreducible adjacencies that have an incidence 

i=o j=o ' i=0j=i+i ' i=o j=o ' i=o/=i+i ' transmission of 0, as: 

2 i-1 2 oo 2/1 

E£(''- />*v = EEf- ^ v v it,- , < « - it, 

/=0/=0 !=0/=; . +1 Z^A. ./ 

2 i-l 2 oo 2 oo 2 i-l 2 i-l 

££(«•-/)*«-£ £(/-«•-!)*¥ = £ £ *v- EEci-Mv^zEE^^ 2 "- 2 * 

i=0 7=0 i=0 ;'=i+l i=0 ;'=i+l i=0 ;'=0 i=0 ;'=0 




Figure 18 Toward demonstrating Lemma 33. (A) A graph with a single junction adjacency (coloured red) and two bridge adjacencies. (B) A 
graph with the same size and cardinality as that in (A), but with an additional junction adjacency. 



Paten etal. BMC Bioinformatics 2014, 15:206 
http://www.biomedcentral.eom/1 471 -21 05/1 5/206 



Page 25 of 31 



therefore b y substitution: 

2 oo 2 oo 

i=0 ;=H-1 /=0 /=/+l 

The right-hand side of the inequality is the number of 
adjacencies with a negative incidence transmission plus 
two times the number of G-irreducible adjacencies with 
an incidence transmission of 0. 

As SLoI^i+i(/ - * - l )hj can not be negative, it 
remains only to prove that this term must be positive. 
Assume that there are 2n or more adjacencies that either 
have a negative incidence transmission, or which are G- 
irreducible and have an incidence transmission of 0 (i.e. 
a contradiction of the lemma). As n > 0, there must be 
at least one adjacency in G" with a projected incidence 
of 0 and a received incidence of greater than 0 in some 
ancestral thread (i.e. Ylj^i > 0)* Either such an edge 
has a received incidence of 2 or greater, in which case 
the considered term must be positive, or a larger graph 
exists (see Figure 16) that is //-bounded and //-unbridged 
with respect to a graph H, which has the same number of 
adjacencies as G and an extra edge with a projected inci- 
dence of 0 and a received incidence of 2 or greater, which 
implies that Yll=o k Uj + 2/< < 2n. In either case we 

derive a contradiction to the assumption of the number of 
adjacencies, therefore: 

2 oo 

2n - 1 > k i>j + 2k ' 

i=0 j=i+l 

□ 

Lemma 31. The G-unbridged graph G" of a G-bounded 
history graph G' contains less than or equal to 3n — 3 G- 
reducible adjacencies with an incidence transmission ofO. 

Proof Let X be the set of G-reducible adjacencies with 
an incidence transmission of 0 in G" . The sum of received 
incidences equals the sum of projected incidences for 
members of X, therefore the sum of received incidences 
of other adjacencies in G" (< 2n — 1 G-reducible adja- 
cencies with positive or negative incidence transmission 
by Lemmas 29 and 30 and < n G-irreducible adjacen- 
cies) is equal to the sum of their projected incidences, 
which by Lemma 28 is at most 2(n + 2n — 1) — 2 = 
6n — 4. By Lemma 25, any adjacency e = {x a ,yp} in 
X must be a junction adjacency, and, as it has 0 inci- 
dence transmission, must have a hanging endpoint and 
projected incidence of 2 (see Figure 17(A)). Let e' and e n 
be the adjacencies incident with A(x a ) and A(yp), respec- 
tively (see Figure 17(A)). As there exist no G-reducible 



ping adjacencies, e' and e n are either G-irreducible or G- 
reducible junction adjacencies with a negative incidence 
transmission. As e projects at least one incidence to each 

r^—^k 4 h h ^ 

— ~y< 
i — \ 

■Nf V 

/ h <:■ 

4 \ \ 

* — * \ \ 

w4- 6 ; 

,--'6 6 i 

B 

^f- ^ 4 

4 — \ ! 

/ V- -4i ■=> — ^ 
4 > ! 

* \ ! 

/ ^ ^ I 

4 4- 

4 — ^ \ 4 — ^ \ 
^ ^- -5=. 

Figure 19 Demonstrating the tightness of the adjacency bound 
in Lemma 33. (A) A history graph with 3 adjacencies. (B) A 
G-bounded AVG extension of the graph in (A) with 22 extra 
G-reducible adjacencies and 44 extra attached vertices, the maximum 
possible. The leaf adjacencies connecting the labeled leaves are 
G-irreducible, as there are no other labeled vertices in G' . All the other 
adjacencies are junctions or bridges. The number of elements in the 
red subgraph of (B) corresponds to the maximum number of 
adjacencies and attached vertices that can be added given the 
inclusion of the red subgraph in (A), and similarly for the blue 
subgraph. By extrapolation this demonstrates the bounds on the 
number of additional adjacencies and attached vertices are tight for 
all possible numbers of adjacencies in the original graph. 
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such adjacency, X has a cardinality at most (6n — 4)/2 = 
3n — 2. It remains to prove that it must be at least one less 
than this bound. 

Now let e be a G-reducible junction adjacency in G" 
that is contained in a thread that is ancestral or unre- 
lated to all threads that contain a G-reducible adjacency 
or label. If G" contains more adjacencies than G then such 
an adjacency must clearly exist in G" . 

If e makes projected incidences to G-irreducible adja- 
cencies then it makes projected incidences to adjacen- 
cies not in X. If e does not make projected incidences 
then it has negative incidence transmission, and either 
e is a hanging adjacency, in which case it must receive 
projected incidences from adjacencies that are not in X 
(else there exists a G-reducible ping adjacency), or e is 
not a hanging adjacency and a larger graph exists (see 
Figure 17(B)) that is //-bounded with respect to a graph 
H with the same number of adjacencies as G, in which 
case, using Lemma 30, there must be less than 2n — 1 
G-reducible negative transmission incidence adjacencies 
in G" . Therefore either there exist projected incidences 
made between adjacencies not in X or there are fewer 
than 2n — 1 G-reducible negative transmission incidence 
adjacencies in G' either way, there are fewer than 6n — 4 
projections made from adjacencies in X to adjacencies 
not in X, and as there are no projections made between 
adjacencies in X, and all adjacencies in X have a pro- 
jected incidence of 2, therefore X has cardinality less 
than 3n-2. □ 

Lemma 32. A G -bounded history graph G' contains less 
than or equal to 5n — 4 junction adjacencies. 

Proof. From Lemmas 29, 30 and 31 it follows that the 
unbridged graph of G' contains less than 5n — 4 junc- 
tion adjacencies. Extending the argument of Lemma 25, 



it is easily verified that G' contains the same number of 
junction adjacencies as its unbridged graph. □ 

Lemma 33. A G-bounded history graph G' contains less 
than or equal to 1 On — 8 G-reducible adjacencies and 20n — 
16 additional attached vertices. These bounds are tight for 
all n > 1. 

Proof Let i and / be the numbers of G-reducible junc- 
tion and bridge adjacencies in G', respectively. As bridges 
and junctions are the only G-reducible adjacencies in G r , 
i + j is equal to the total number of G-reducible adjacen- 
cies in G' . Assume that i + / > lOn — 8. From Lemma 32 
it follows that i < 5n — 4, therefore j > 5n — 4. As j > 
5n — 4, it follows from Lemma 32 there exists in G' a pair 
of G-reducible bridge adjacencies {x a ,yp}> {w a ,zp} such 
that [A(Xa),A(yp)} = {A(w a ),A(zp)} (see Figure 18(A)). 
However, in this case there exists an extension H of 
G that contains the same number of adjacencies as G' 
but one additional G-reducible junction adjacency (see 
Figure 18(B)), therefore in H the number of G-reducible 
junction adjacencies is greater than 5n — 4, a contradiction 
of Lemma 32, therefore i+ j < lOn — 8. From this bound, 
trivially, the bound of the number of additional attached 
vertices follows. Figure 19 shows both bounds are tight for 
all n. □ 

Let m be the number of labeled vertices in the history 
graph G. As with the n = 0 case, the m = 0 case is sim- 
ilarly trivial, but in terms of the number of G-reducible 
labels. 

Lemma 34. If m = 0 any G-bounded extension of G 
contains 0 labels. 

Proof. Follows from Lemma 24. □ 



^ B ^ 



/ \ \ X 
/ \ \ 



^ ^ ^ ^> 

Figure 20 Demonstrating the tightness of the vertex label bound in Lemma 35. (A) A history graph with 5 labeled vertices. (B) A G-bounded 
AVG extension of the graph in (A) with 8 extra labeled vertices, the maximum possible. As only the leaf labels have the given label colour their labels 
are G-irreducible, all the other labels in the graph are bridges or junctions. For each extra labeled leaf vertex added to (A) an extra pair of junction 
and bridge adjacencies can be added to (B), thus the bound is tight for all m. 



Paten etal. BMC Bioinformatics 2014, 15:206 
http://www.biomedcentral.eom/1 471 -21 05/1 5/206 



Page 27 of 31 



Now assume that m > 1 and that n > 0. 

Lemma 35. A G-bounded history graph G' contains less 
than or equal to 2m — 2 G-reducible vertex labels. This 
bound is tight for all m > 1. 

Proof. Let i and / be the number of junction and bridge 
labels, respectively, in G '. By Lemma 24, the total num- 
ber of G-reducible labels in G' is less than or equal to 
i + /. The number of bridges ; is less than or equal 
to the total number of child branches of vertices that 
are label junctions, which, as the connected components 
of branches are trees, is equal or fewer than two times 
the number of leaf labels minus 2, and therefore equal 
or fewer than 2m — 2. Furthermore, by definition, the 
lifting ancestor of a vertex with a bridge label has a non- 
trivial lifted label, which implies such a vertex's label is 
not a bridge, therefore / < 2m — 2 — z, therefore / + 
i < 2m — 2. Figure 20 shows this bound is tight for 
all m. □ 

We are now in a position to prove the desired theorem 
for any value of n and m. 

Theorem 3. A G-bounded history graph contains less 
than or equal to max(0, lOn — 8) G-reducible adjacencies 
and max(0, 2m — 2, 20n — 16, 20n + 2m — 18) additional 
vertices. This bound is tight for all values of n > 0 and 
m>0. 

Proof. Lemmas and 27 and 33 prove the bound on the 
number of G-reducible adjacencies, it remains to prove 
the bound on the number of additional vertices. 

Let X, Y and Z be the total numbers, respectively, 
of additional attached, labeled and both unattached and 
unlabeled vertices in G ' . 

From Lemmas 27 and 33 it follows that X < 
max(0, 20n — 16). From Lemmas 34 and 35 it follows that 
Y < max(0, 2m — 2). Combining these results X + Y < 
max(0, 2m - 2, 20n - 16, 20n + 2m- 18). 

Assume X + Y + Z > max(0, 2m - 2, 20n - 16, 20n + 
2m - 18). As X + Y < max(0, 2m - 2, 20n - 16, 20n + 
2m — 18), Z > 1. As G' contains no non-minimal 
branches, Z is a count of additional root vertices that 
are unlabeled, unattached and have two or more chil- 
dren, all of which are either labeled, attached or both. 
Using this information, it is straightforward to demon- 
strate that there exists a modified pair of history graphs 
(H,H f ) such that H has the same size and cardinal- 
ity as G, and H' is a //-bounded extension of H that 
has more labeled or attached vertices than G ' . The exis- 
tence of (H, H') contradicts either or both Lemmas 35 or 
Lemma 33. 

Figure 21 shows this bound is tight for all n and m. □ 
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Figure 21 Demonstrating the tightness of Theorem 3. (A) The 

combination of the history graphs in Figures 1 9(A) and 20(B), 
constructed by merging their root vertices. (B) A G-bounded AVG 
extension of the graph in (B) with 22 extra adjacencies and 54 extra 
vertices, the maximum possible. The colouring of the elements is 
used to demonstrate the bound is tight for any combination of m and 
n, and follows that used in Figure 1 9. 



Proof of Theorem 4 

A adjacency {x a ,yp} is old if both A(x a ) and A(yp) are 
each independently either the side of a free-root or inci- 
dent with a G-irreducible adjacency. 
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Figure 22 A sequence of label/adjacency detachments that transform a G-bounded history graph into G. Label detachments of a (A-B) 

bridge label and (B-C) junction label. Bond detachments of a (C-D) bridge adjacency and (old) (D-E-F) junction adjacencies. Elements outlined in 
red are those being removed. 



Lemma 36. For any G-bounded history graph G f not iso- label that is a bridge then its label detachment results 
morphic to G there exists a label detachment or adjacency in a G-bounded reduction (see Figure 22(A-B)). Else if 
detachment that results in a G-bounded history graph. there exists a G-reducible label it is a junction label and 

its label detachment results in a G-bounded reduction 

Proof. As G is not isomorphic to G r , G' contains one or (see Figure 22(B-C)). Else if there exists a G-reducible 
more G-reducible elements. If there exists a G-reducible adjacency that is a bridge then its adjacency detachment 



X y x 



c O-O 





Figure 23 Lateral-adjacency detachments. (A-B) The elements in red are G-reducible. (C) A not G-bounded intermediate of the lateral-adjacency 
detachment that contains a G-reducible ping adjacency. 
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results in a G-bounded reduction (see Figure 22(C-D)). 
Else there exists a G-reducible adjacency that is an 
old junction adjacency and whose adjacency detachment 
results in a G-bounded reduction (see Figure 22(D-E-F)) . □ 

The previous lemma implies that for any G-bounded 
history graph there exists a sequence of label and adja- 
cency detachments that results in G. We now seek the 
inverse, to demonstrate the existence of a sequence of 
moves to create a G-bounded AVG from any G-bounded 
history graph. 

The inverse of a label/adjacency/lateral-adjacency detach- 
ment is, respectively, a lab el/ adjacency /lateral- adjacency 
attachment. 

The graph in Figure 23(A) has no valid label/adjacency 
attachment operation that results in a G-bounded history 
graph, yet it is not an AVG, because it has an unattached 



junction side, x f tail and any adjacency attachment of x f tail 
results in the creation of a G-reducible ping adjacency. 
This motivates the need for the lateral-adjacency detach- 
ment/attachment operation, that we use to avoid the 
creation of G-reducible ping adjacencies. Notably, while 
both label detachment and adjacency detachment opera- 
tions define reductions, the result of a lateral-adjacency 
detachment, though an extension of G, is not necessarily 
a reduction of the starting graph, though it always has one 
fewer adjacency. 

Lemma 37. A G-bounded history graph G' such that 
u{G) > 0 has a label/ adjacency /lateral- adjacency attach- 
ment that results in a G-bounded history graph. 

Proof. If G' has a free-root x such that \L' X \ > 1, then 
the labeling of the root of the branch-tree whose free-root 
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Figure 24 An illustration of an infinite set of G-minimal AVG extensions. (A) G. (B) G with an added single copy of the repeating unit in bold. 
(C) G with the added terminal elements in bold. (D) A G-minimal AVG extension with 2 copies of the repeat subunit and the terminal elements that 
attach w 2 and y 2 . Vertices are identified by lowercase letters, with superscripts used to denote distinct copies of elements in the repeating subunit. 
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is x is a label attachment that results in a G-bounded 
extension that contains an additional junction label (see 
Figure 7(A-B) in the main text). Else, if G' has substitution 
ambiguity then there exists a labeled vertex with two or 
more non-trivial lifted labels for which there exists a label 
attachment that results in a G-bounded extension, which 
contains an additional bridge label (see Figure 7(B-C) in 
the main text). Else G' has rearrangement ambiguity. If G' 
has one or more unattached junction sides, let x a be such 
a side. If the most ancestral attached descendants of x a are 
not incident with hanging adjacencies then the creation 
of an isolated vertex y and adjacency {x a ,y a } is an adja- 
cency attachment that results in a G-bounded extension 
(see Figure 7(C-D) in the main text). Else there exists a 
lateral-adjacency attachment that results in a G-bounded 
history graph in which Xqi IS incident with an adjacency 
with no hanging endpoints (see Figure 7(D-E), the opera- 
tion is also an adjacency attachment in this example). Else 
G' does not have an unattached junction side, and there 
exists an attached junction side with two or more inci- 
dent non-trivial lifted adjacencies for which there exists an 
adjacency attachment that results in a G-bounded exten- 
sion that contains an additional bridge adjacency (see 
Figure 7(E-F) in the main text). □ 

Given Theorem 3, the previous lemma implies that for 
any G-bounded history graph there exists a sequence of 
label/adjacency/lateral-adjacency attachment operations 
that result in a G-bounded AVG. 

Theorem 4. The G-bounded poset is finite, has a single 
least element G, its set of maximal elements are A VGs, and 
if and only if there exists a G-bounded reduction operation 
to transform G" into G' then G' < -gG" . 

Proof That G-bounded is finite follows from 
Theorem 3. Lemma 36 implies it has a single least 
element. As a corollary of Theorem 3 and Lemma 37 it fol- 
lows that the set of maximal elements of the G-bounded 
poset are AVGs. 

It remains to prove G' < -qG" if and only if there exists 
a G-bounded reduction operation to transform G" into 
G f . The only if follows by definition. If G' <q G" but not 
G' < -qG" then there exists a G'" such that G' <q 
G'" < G G". If G" is transformed to G' by a single G- 
bounded reduction operation, to complete the proof it is 
sufficient to show that no such G'" can exist, this is easily 
verified. □ 

Proof of Lemma 9 

Lemma 9. There exists a history graph G with an infinite 
number of G -minimal extensions. 

Proof We will demonstrate there exists an infinite set 
of G-minimal AVG extensions of the history graph G 



shown in Figure 24(A). The extensions are composed of 
the repeating subgraph shown in bold in Figure 24(B) and 
the terminal elements shown in bold in Figure 24(C) that 
attach the most ancestral copies of w and y. 

Consider the AVG extension Ho with zero copies of the 
repeating subunit and the terminal elements to attach w° 
and y°, as in Figure 24(C). As a, b, c and d are labeled but 
no other vertices are labeled, the adjacencies {a^ead^ bhead] 
and {chead> dhead) are G-irreducible, because removal of 
either in any reduction would create a graph that can 
not then be an extension of G. Given this observation, 
by definition w° and y° or any vertices produced by con- 
tracting incident branches of w° and y° must be junctions 
in any G-minimal reduction, and therefore be attached, 
but by definition of the reduction relation, {w® ead , w head) 

and iyh eae i> y head} Can n0t ^ e remove( ^ an( ^ Y et w ° an< ^ 

y° be attached in any G-minimal reduction. This there- 
fore implies that the bridge adjacencies K x \ ea ^ x head} anc ^ 

[ z head> ^head) are a ^ so not remove d m a G-minimal reduc- 
tion, but this is all the adjacencies in Hq, as all the vertices 
in Ho are attached, therefore Ho is G-minimal. 

Let Hi be an AVG with i such layers, where i > 0 
(Figure 24(D) shows an example for i = 2). To prove that 
Hi is a G-minimal AVG extension we proceed by induc- 
tion. Ho is the base case. Assume the adjacencies incident 
with w l ~ l and y l ~ l are not removed in any G-minimal 
reduction. Using similar logic to the base case the adjacen- 
cies incident w\x\ y l and z l are similarly not removed in 
a G-minimal reduction, again as this is all the added adja- 
cencies and all vertices are attached, using the induction 
therefore Hi is G-minimal. □ 

Endnotes 

a The contraction of an edge e is the removal of e from 
the graph and merger of the vertices x and y incident 
with e to create new vertex z, such that edges incident 
with z were incident either with x or y or both, in the 
latter case becoming a loop edge on z. 

b Note: while 7~L(G) is infinite we show in the sequel that 
the infimum of this set of costs is always achieved by a 
history, hence the infimum is the minimum. 
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